Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
translated by 谷歌翻译
Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.
translated by 谷歌翻译
测试时间培训通过使用自学意义的每个测试输入优化模型,可以随时适应新的测试分布。在本文中,我们将蒙版的自动编码器用于这个单样本学习问题。从经验上讲,我们的简单方法改善了许多视觉基准的概括,以进行分配变化。从理论上讲,我们根据偏见变化权衡取得的改进来表征。
translated by 谷歌翻译
对象检测是用于测试预先训练的网络参数的中央下游任务是否达到益处,例如提高准确度或训练速度。当新架构(如视觉变压器(VIT)模型到达时,物体检测方法的复杂性可以使该基准是非微不足道的。这些困难(例如,架构不相容,慢训练,高记忆消耗,未知的培训公式等)已经阻止了最近通过标准VIT模型进行了基准测试转移学习的研究。在本文中,我们提出了克服这些挑战的培训技术,使得使用标准的VT模型作为面膜R-CNN的骨干。这些工具促进了我们研究的主要目标:我们比较五种Vit初始化,包括最近的最先进的自我监督的学习方法,监督初始化和强大的随机初始化基线。我们的研究结果表明,最近基于掩蔽的无监督学习方法可能是在COCO的令人信服的转移学习改进,将箱子AP增加到4%(绝对)的监督和先前自我监督的预训练方法。此外,基于掩蔽的初始化比例更好,随着模型尺寸的增加而增长的提高。
translated by 谷歌翻译
本文显示屏蔽的自动化器(MAE)是可扩展的自我监督学习者,用于计算机愿景。我们的MAE方法很简单:我们掩盖输入图像的随机补丁并重建缺失像素。它基于两个核心设计。首先,我们开发一个不对称的编码器解码器架构,其中编码器仅在掩码的可见子集(没有掩码令牌)上,以及重量解码器,该重量解码器从潜像和掩码令牌重建原始图像。其次,我们发现掩蔽了高比例的输入图像,例如,75%,产生非凡和有意义的自我监督任务。耦合这两种设计使我们能够有效且有效地培训大型模型:我们加速培训(3倍或更多)并提高准确性。我们可扩展的方法允许学习概括的高容量模型:例如,Vanilla Vit-Maxim模型在使用Imagenet-1K数据的方法中实现最佳准确性(87.8%)。下游任务中的转移性能优于监督预培训并显示有前途的缩放行为。
translated by 谷歌翻译
通过最小化同一图像的两个视图之间的距离来最大程度地减少自我监督学习的非对比度方法(例如BYOL和SIMSIAM)。这些方法在实践中取得了非凡的表现,但是理论理解落在了背后。天等。 2021解释了为什么表示形式不会崩溃到零,但是如何学习该功能仍然是神秘的。在我们的工作中,我们在线性网络中证明了非对抗性方法,学习了理想的投影矩阵,并降低了下游任务的样本复杂性。我们的分析表明,重量衰减是一个隐式阈值,它在数据增强下丢弃具有较高差异的特征,并保持差异较低的功能。受我们的理论的启发,我们通过在Tian等人的原始直接销售算法中删除特征分解步骤,从而设计了更简单,更有效的算法直接副本。 2021.我们的实验表明,直接竞争对手甚至超过了STL-10,CIFAR-10,CIFAR-100和IMAGENET的表现。
translated by 谷歌翻译
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: selfsupervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other selfsupervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
translated by 谷歌翻译
Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.
translated by 谷歌翻译
Imitation learning (IL) is a simple and powerful way to use high-quality human driving data, which can be collected at scale, to identify driving preferences and produce human-like behavior. However, policies based on imitation learning alone often fail to sufficiently account for safety and reliability concerns. In this paper, we show how imitation learning combined with reinforcement learning using simple rewards can substantially improve the safety and reliability of driving policies over those learned from imitation alone. In particular, we use a combination of imitation and reinforcement learning to train a policy on over 100k miles of urban driving data, and measure its effectiveness in test scenarios grouped by different levels of collision risk. To our knowledge, this is the first application of a combined imitation and reinforcement learning approach in autonomous driving that utilizes large amounts of real-world human driving data.
translated by 谷歌翻译
Backdoor attacks represent one of the major threats to machine learning models. Various efforts have been made to mitigate backdoors. However, existing defenses have become increasingly complex and often require high computational resources or may also jeopardize models' utility. In this work, we show that fine-tuning, one of the most common and easy-to-adopt machine learning training operations, can effectively remove backdoors from machine learning models while maintaining high model utility. Extensive experiments over three machine learning paradigms show that fine-tuning and our newly proposed super-fine-tuning achieve strong defense performance. Furthermore, we coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed. Empirical evaluation shows that, compared to other defense methods, super-fine-tuning leaves limited backdoor sequela. We hope our results can help machine learning model owners better protect their models from backdoor threats. Also, it calls for the design of more advanced attacks in order to comprehensively assess machine learning models' backdoor vulnerabilities.
translated by 谷歌翻译